This is an R Markdown Notebook for de_contamination analysis in genome assembly projects. Here, I explain how to run two different de-contamination programs (kraken2 and blobtools).
What is kraken2?
Is “a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer”. (https://github.com/DerrickWood/kraken2/wiki/Manual#special-databases)
To run kraken2 you need your genome assembly and the index fai file for the assembly.
# /bin/sh
# ----------------Parameters---------------------- #
#$ -S /bin/sh
#$ -q mThC.q
#$ -pe mthread 12 -l mres=48G,h_data=4G,h_vmem=4G
#$ -cwd
#$ -j y
#$ -N kraken
#$ -o kraken.log
#$ -m bea
#$ -M ariasc@si.edu
#
# ----------------Modules------------------------- #
module load bioinformatics/kraken
#
# ----------------Your Commands------------------- #
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
kraken2 --db /data/genomics/db/Kraken/kraken2_db/ --report kraken_report_remi --use-names --confidence --threads $NSLOTS /scratch/genomics/ariasc/ariasc_pool/krake2/HPA_04_pilon.fasta
#
echo = `date` job $JOB_NAME done
Kraken Output
Holocanthus passer
First we will get only the list of contigs and its classification
grep "Contig" kraken2.log > kraken2_results.txt
Get list of human and unclassified contigs: By doing this, we are removing anything classified as bacteria, plasmids or viruses. We keep the unclassified because they might contain real contigs.
grep "unclassified\|Homo sapiens" kraken2_results.txt | cut -f2 > assembly.fa.HsU.list
In order to explore kraken resuls, I download the results files and I use R to explore.
module load bio/rclone
rclone copy kraken2_results.txt db3:hydra_backup/
Exploring the results in R.
kraken_results<-read.delim("kraken2_results_remi.txt", header = F)
colnames(kraken_results) <- c("Classification", "contig", "taxon", "length", "kmers")
sub_kraken_results <- filter(kraken_results, taxon !="Homo sapiens (taxid 9606)" & taxon !="unclassified (taxid 0)")
sub_kraken_results %>% kbl() %>% kable_styling()
| Classification | contig | taxon | length | kmers |
|---|---|---|---|---|
| C | ctg307_pilon_pilon | cellular organisms (taxid 131567) | 20487 | 0:8871 228410:3 0:1568 9606:5 0:7534 530564:5 0:2467 |
| C | ctg322_pilon_pilon | Dickeya dadantii (taxid 204038) | 20351 | 0:1233 274537:2 0:4898 204038:1 0:3846 204038:3 0:3226 204038:3 0:2931 204038:1 0:4173 |
| C | ctg350_pilon_pilon | Cellulosilyticum lentocellum DSM 5427 (taxid 642492) | 12701 | 0:169 9606:1 0:11990 642492:4 0:503 |
| C | ctg373_pilon_pilon | Campylobacter lari (taxid 201) | 10068 | 0:1440 201:5 0:8589 |
| C | ctg425_pilon_pilon | Mycolicibacterium rhodesiae NBB3 (taxid 710685) | 6563 | 0:951 710685:5 0:5573 |
| C | ctg429_pilon_pilon | Legionella pneumophila (taxid 446) | 6166 | 0:1766 446:5 0:4361 |
| C | ctg449_pilon_pilon | Bartonella birtlesii IBS 325 (taxid 1095900) | 5579 | 0:3273 1095900:5 0:2267 |
| C | ctg458_pilon_pilon | Hymenobacter sedentarius (taxid 1411621) | 3868 | 0:690 1411621:2 0:3142 |
| C | ctg481_pilon_pilon | cellular organisms (taxid 131567) | 9386 | 0:2913 1484116:5 0:143 9606:5 0:6286 |
| C | ctg480_pilon_pilon | Burkholderia diffusa (taxid 488732) | 5053 | 0:704 488732:5 0:4310 |
#sum(sub_kraken_results$length)
#100222 ~100Kbp
Kraken2 found ten contigs contaminated with bacteria and archea. This 10 contigs made 100kbp.
Using samtools, we will extract from the assembly only the contigs identified as Human or unclassified. Bacteria, viruses and plasmids will be excluded.
module load bioinformatics/samtools
xargs samtools faidx original_assembly.fa < assembly.fa.HsU.list > new_assembly.fa
What is blobtools?
Is “A modular command-line solution for visualisation, quality control and taxonomic partitioning of genome dataset” (https://github.com/DRL/blobtools)
First, you need to blast your assembly to know nt databases. for this we will used blastn program. we will used our job file blast.job to the program in the server.
# /bin/sh
# ----------------Parameters---------------------- #
#$ -S /bin/sh
#$ -pe mthread 12
#$ -q mThC.q
#$ -l mres=96G,h_data=8G,h_vmem=8G
#$ -cwd
#$ -j y
#$ -N acti_blast
#$ -o acti_blast.log
#$ -m bea
#$ -M ariasc@si.edu
#
# ----------------Modules------------------------- #
module load bioinformatics/blast
#
# ----------------Your Commands------------------- #
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
blastn -db /data/genomics/db/ncbi/db/latest_v4/nt/nt -query HPA_04_pilon.fasta -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-20 -num_threads $NSLOTS -out HPA_blast.out
#
echo = `date` job $JOB_NAME done
Second, you need to map raw reads to the genome assembly. We used bwa to map and samtools to convert from sam to ban and finally sort the bam file.
#Index genome assembly
bwa index -p HPA_04_pilon.fasta
#run bwa
bwa mem -t $NSLOTS reads/HPA_04_R1.fastq reads/HPA_04_R2.fastq > HPA_04_pilon_bwa.sam
#comvert to bam and sort with samtools
samtools view -bu -@ $NSLOTS HPA_04_pilon_bwa.sam | samtools sort -T HPA_04 -O bam -@ $NSLOTS -l 9 -m 2G -o HPA_04_pilon_bwa_aligned_sorted.bam
First, you need to create a database with blast and mapping files
# /bin/sh
# ----------------Parameters------------#
#$ -S /bin/sh
#$ -q mThM.q
#$ -l mres=24G,h_data=24G,h_vmem=24G,himem
#$ -cwd
#$ -j y
#$ -N blob_create
#$ -o blob_create.log
#$ -m bea
#$ -M ariasc@si.edu
#
# ----------------Modules------------ #
module load bioinformatics/blobtools
#
# ----------------Your_Commands---- #
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
#
blobtools create \
-i HPA_04_pilon.fasta \
-t HPA_blast.out \
-b HPA_04_pilon_bwa_aligned_sorted.bam \
-o HPA_04_blob
#
echo = `date` job $JOB_NAME done
Second to view/summarize results you need to process the .json file or you can directly plot the results.
#you can used the interactive node in hydra since it do not uses to much memmory or time
qrsh
#to view results in as a table.
blobtools view -i HPA_04_blob.blobDB.json -o ./
#to plot results
blobtools plot -i HPA_04_blob.blobDB.json -o ./
This created two png files with the plots. Look at file in project folder.
knitr::include_graphics('HPA_04_blob.blobDB.json.bestsum.phylum.p7.span.100.blobplot.bam0.png')
knitr::include_graphics('HPA_04_blob.blobDB.json.bestsum.phylum.p7.span.100.blobplot.read_cov.bam0.png')
qrsh -pe mthread 8
module load bio/blobtools/2.6.3
#Create a BlobDir
blobtools create --fasta /scratch/genomics/ariasc/remy/blob/HPA_04_pilon.fasta HPA_04_pilon
#Add BLAST hits
blobtools add --threads 8 --hits /scratch/genomics/ariasc/remy/blob/HPA_blast2.out --taxrule bestsumorder --taxdump /scratch/genomics/ariasc/remy/blob/blopbtools2_run/taxdump HPA_04_pilon
#Add mapping coverage
blobtools add --threads 8 --cov /scratch/genomics/ariasc/remy/blob/HPA_04_pilon_bwa_aligned_sorted.bam HPA_04_pilon
#Download db HPA_04_pilon folder to desktop and run blobtools view
rclone copy HPA_04_pilon db3:hydra_backup/
#Create interactive html page with the blobtools results. this command was run in my personal mac computer were I install blobtools2 using conda.
#activate conda enviroment for blobtools2
conda activate btk_env
./blobtools view --remote /Users/solracarias/Desktop/HPA_04_pilon
You can play with viewer to change colors on figures, tables etc. Figures and tables can be downloaded and visualized.
#
#
knitr::include_graphics('HPA_04_pilon.blob.circle.png')
Blobtools Results
#
knitr::include_graphics('HPA_04_pilon.cumulative.png')
#
knitr::include_graphics('HPA_04_pilon.snail.png')
blob_results<- read.csv("bestsumororder_table.csv",header = T, sep = ",")
blob_results<-select(blob_results, -1)
blob_results_nohits <- blob_results %>% filter(bestsumorder_phylum=="no-hit")
Kraken_blob_results<- merge(blob_results, kraken_results, by.x ="id", by.y = "contig", all.x=F, all.y=F)
filt_kraken_blob_results <- filter(Kraken_blob_results, taxon !="Homo sapiens (taxid 9606)" & taxon !="unclassified (taxid 0)")
filt_kraken_blob_results %>% kbl() %>% kable_styling()
| id | X_id | gc | length.x | bestsumorder_phylum | bestsumorder_class | bestsumorder_order | bestsumorder_family | bestsumorder_genus | bestsumorder_species | HPA_04_pilon_bwa_aligned_sorted_cov | Classification | taxon | length.y | kmers |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ctg307_pilon_pilon | 304 | 0.3908 | 20487 | Chordata | Actinopteri | Blenniiformes | Blenniidae | Salarias | Salarias fasciatus | 241.0585 | C | cellular organisms (taxid 131567) | 20487 | 0:8871 228410:3 0:1568 9606:5 0:7534 530564:5 0:2467 |
| ctg322_pilon_pilon | 323 | 0.4026 | 20351 | Chordata | Actinopteri | Spariformes | Sparidae | Sparus | Sparus aurata | 129.5457 | C | Dickeya dadantii (taxid 204038) | 20351 | 0:1233 274537:2 0:4898 204038:1 0:3846 204038:3 0:3226 204038:3 0:2931 204038:1 0:4173 |
| ctg350_pilon_pilon | 352 | 0.4053 | 12701 | Chordata | Actinopteri | Perciformes | Percidae | Lateolabrax | Lateolabrax maculatus | 360.8253 | C | Cellulosilyticum lentocellum DSM 5427 (taxid 642492) | 12701 | 0:169 9606:1 0:11990 642492:4 0:503 |
| ctg373_pilon_pilon | 368 | 0.3631 | 10068 | Chordata | Actinopteri | Actinopteri-undef | Pomacentridae | Amphiprion | Amphiprion ocellaris | 155.7172 | C | Campylobacter lari (taxid 201) | 10068 | 0:1440 201:5 0:8589 |
| ctg425_pilon_pilon | 418 | 0.4435 | 6563 | Chordata | Actinopteri | Spariformes | Sparidae | Sparus | Sparus aurata | 184.2904 | C | Mycolicibacterium rhodesiae NBB3 (taxid 710685) | 6563 | 0:951 710685:5 0:5573 |
| ctg429_pilon_pilon | 430 | 0.3800 | 6166 | Chordata | Actinopteri | Pempheriformes | Lateolabracidae | Lateolabrax | Lateolabrax maculatus | 143.1620 | C | Legionella pneumophila (taxid 446) | 6166 | 0:1766 446:5 0:4361 |
| ctg449_pilon_pilon | 449 | 0.3970 | 5579 | Chordata | Actinopteri | Spariformes | Sparidae | Sparus | Sparus aurata | 74.8993 | C | Bartonella birtlesii IBS 325 (taxid 1095900) | 5579 | 0:3273 1095900:5 0:2267 |
| ctg458_pilon_pilon | 450 | 0.4196 | 3868 | Chordata | Actinopteri | Pempheriformes | Lateolabracidae | Lateolabrax | Lateolabrax maculatus | 0.0000 | C | Hymenobacter sedentarius (taxid 1411621) | 3868 | 0:690 1411621:2 0:3142 |
| ctg480_pilon_pilon | 484 | 0.3725 | 5053 | no-hit | no-hit | no-hit | no-hit | no-hit | no-hit | 2498.0321 | C | Burkholderia diffusa (taxid 488732) | 5053 | 0:704 488732:5 0:4310 |
| ctg481_pilon_pilon | 478 | 0.4563 | 9386 | Chordata | Actinopteri | Holocentriformes | Holocentridae | Myripristis | Myripristis murdjan | 157.8065 | C | cellular organisms (taxid 131567) | 9386 | 0:2913 1484116:5 0:143 9606:5 0:6286 |